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Abstract 

We introduce a conceptually novel and powerful tech* 
nique to achieve fault tolerance in hardware and soft- 
ware systems. When used for software fault tolerance, 
this new technique uses time and software redundancy 
and can be outlined as follows. In the initial phase, 
a program is run to solve a problem and store the re- 
sult. In addition, this program leaves behind a trail of 
data which we call a certification trail. In the second 
phase, another program is run which solves the origi- 
nal problem again. This program, however, has access 
to the certification trail left by the first program. Be- 
cause of the availability of the certification trail, the 
second phase can be performed by a less complex pro- 
gram and can execute more quickly. In the final phase, 
the two results are compared and if they agree the re- 
sults are accepted as correct; otherwise an error is indi- 
cated. An essential aspect of this approach is that the 
second program must always generate either an error 
indication or a correct output even when the certifica- 
tion trail it receives from the first program is incorrect. 
We formalize the certification trail approach to fault 
tolerance and illustrate it by applying it to the funda- 
mental problem of finding a minimum spanning tree. 
We discuss cases in which the second phase can be 
run concurrently with the first and act as a monitor. 
We compare the certification trail approach to other 
approaches to fault tolerance. Because of space lim- 
itations we have ommited examples of our technique 
applied to the Huffman tree, and convex hull problems. 
These can be found in the full version of this paper. 

1 Introduction 

In this paper we introduce a novel and powerful tech- 
ftique for achieving fault tolerance in systems. Al- 
though applicable to both hardware and software, we 
restrict our discussion of this technique in the follow- 
J ag to software fault tolerance. To explain our new 
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technique for software fault tolerance, we will first dis- 
cuss a simpler fault tolerant software method. In this 
method the specification of a problem is given and an 
algorithm to solve it is constructed. This algorithm is 
executed on an input and the output is stored. Next, 
the same algorithm is executed again on the same in- 
put and the output is compared to the earlier output. 
If the outputs differ then an error is indicated, oth- 
erwise the output is accepted as correct. This soft- 
ware fault tolerance method requires additional time, 
so called time redundancy [14, 22); however, it requires 
no additional software. It is particularly valuable for 
detecting errors caused by transient fault phenomena. 
If such faults cause an error during only one of the ex- 
ecutions then either the error will be delected or the 
output will be correct. 

A variation of the above method uses two separate 
algorithms, one for each execution, which have been 
written independently based on the problem specifica- 
tion. This technique, called N-version programming[8, 
4) (in this case N=2), allows for the detection of errors 
caused by some faults in the software in addition to 
those caused by transient hardware faults and utilizes 
both time and software redundancy. Errors caused 
by software faults are detected whenever the indepen- 
dently written programs do not generate coincident 
errors. 

The technique we will describe is designed to achieve 
similar types of error detection capabilities but expend 
fewer resources. The central idea, as illustrated in Fig- 
ure 1, is to modify the first algorithm so that it leaves 
behind a trail of data which we call a certification trail. 
This data is chosen so that it can allow the the sec- 
ond algorithm to execute more quickly and/or have a 
simpler structure than the first algorithm. As above, 
the outputs of the two executions are compared and 
are considered correct only if they agree. Not*, how- 
ever, we must be careful in defining this method or 
else its error detection capability might be reduced 
by the introduction of data dependency between the 
two algorithm executions. For example, suppose the 
first algorithm execution contains a error which causes 
an incorrect output and an incorrect trail of data to 
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Figure 1: Certification trail method. 


be generated. Further suppose that no error occurs 
during the execution of the second algorithm. It still 
appears possible that the execution of the second al- 
gorithm might use the incorrect trail to generate an 
incorrect output which matches the incorrect output 
given by the execution of the first algorithm. Intu- 
itively, the second execution would be “fooled" by the 
data left behind by the first execution. The definitions 
we give below exclude this possibility. They demand 
that the second execution either generates a correct 
answer or signals the fact that an error has been de- 
tected in the data trail. Finally, it should be noted that 
in Figure 1 both executions can signal an error. These 
errors would include run-time errors such as dividc-by- 
xero or non-terminating computation. In addition the 
second execution can signal error due to an incorrect 
certification trail. 


2 Formal Definition of a Certi- 
fication Trail 

In this section we will give a formal definition of a 
certification trail and discuss some aspects of its real- 
izations and uses. 

Definition 2.1 A problem P is formalized as a rela- 
tion (that is, a set of ordered pairs). Let D be the 
domain (that is, the set of inputs) of the relation P 
and let S be the range (that is, the set of solutions) 
for the problem. We say an algorithm A solves a prob- 
lem P iff for all d € D when d is input to A then an 
i € S is output such that (d, s) € P. 

Definition 2.2 Let P : D — ♦ S be a problem. Let 
T be the set of certification trails. A solution to this 
problem using a certijScatton trail consists of two func- 
tions F\ and F 2 with the following domains and ranges 


Fx : D S x T and Ft : D x T — S U {error}. The 
functions must satisfy the following two properties: 

(1) for all d € D there exists s € S and 

there exists i € T such that 

F x (d) = (M) and F 2 {d % t) = $ and (d, s) e P 

(2) for all d € D and for all t € T 

cither (Fj(d, t) = # and (d, s) £ P) or 
,F 2 (d,<) = error. 


The definitions above assure that the error detec- 
tion capability of the certification trail approach is 
comparable to that obtained with the simple tune re- 
dundancy approach discussed eatlier. That is, if tran- 
sient hatdware faults occur during only one of the ex- 
ecutions then either an error will be detected or the 
output will be correct. It should be further noted, 
however, the examples to be considered will indicate 
that this new approach can also save overall execution 
time. 

The certification trail approach also allows for the 
detection of faults in software. As in 2-version pro- 
gramming, separate teams can write the algorithms for 
the first and second executions. Note that the speci- 
fication now must include precise information describ- 
ing the generation and use of the certification trail. 
Because of the additional data available to the sec- 
ond execution, the specifications of the two phases 
can be very different; similarly, the two algorithms 
used to implement the phases can be very different. 
This is illustrated by the convex hull example in the 
full paper. Alternatively, the two algorithms can be 
very similar, differing only in data structure manipu- 
lations. This is illustrated by the minimum spanning 
tree example considered later. When significantly dif- 
ferent algorithms are used, the probability that both 
algorithms will contain or be effected by faults which 
generate matching errors should be reduced. When 
very similar algorithms are used it is sometimes pos- 
sible io save programming effort by sharing program 
code. While this reduces the ability to detect errors 
in the software it does not change the ability to detect 
transient hardware errors as discussed earlier. 

Throughout this section we have assumed that our 
method is implemented with software; however, it is 
clearly possible to implement the certification trail tech- 
nique by using dedicated hardware. It is also possible 
to generalize the basic two-level hierarchy of the cer- 
tification trail approach as illustrated in Figure 1 to 
higher levels. Finally, we note that a wide variety of 
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approaches to software and hardware fault tolerance 
have been proposed which bear resemblances to the 
certification trail approach; we contrast our method 
to the most closely related ideas. A more comprehen- 
sive comparison appears in the full paper. 

3 Minimum Spanning Tree Ex- 
ample 

In this section we illustrate the use of the certification 
trail method by applying it to the minimum spanning 
tree problem. Because of space limitations we have 
ommited other applications, e.g., to the Huffman tree 
and the convex hull problems. It should be stressed 
here that we believe the technique has wide applica- 
bility and these problems were chosen simply for illus- 
tration. 

The minimum spanning tree problem has been ex- 
amined extensively in the literature and an historical 
survey is given in [11]. Our certification trail approach 
is applied to a variant of the Prim/Dijkstra algorithm 
[19, 9] as explicated in [24]. We will begin our dis- 
cussion of the application of the certification trail ap- 
proach to the minimum spanning tree problem with 
some preliminary definitions. 



Definition 3.1 A graph G = (V, E) consists of a ver- 
tex set V and an edge set E . An edge is an un- 
ordered pair of distinct vertices which we notate as, 
for example, [v, u»], and we say v is adjacent to to. A 
path in a graph from v\ to v* is a sequence of ver- 
tices V\ y . t vu such that [ri,r;+i] is an edge for 
i <E A - 1}. A path is a cycle if A > 1 and 

V\ = An acyclic graph is a graph which contains 

no cycles. A connected graph is a graph such that for 
all pairs of vertices v,u> there is a path from v to to. A 
tree is an acyclic and connected graph. 



Definition 3.2 Let G = (V, E) be a graph and let w 
be a positive rational valued function defined on E. 
A subtree of G is a tree, T(V\E ')♦ with V' C V and 
E 9 C E . We say T spans V 1 and V 9 is spanned by 
T. If V 9 = V then we say T is a spanning tree of G. 
The weight of this tree is u*(e). A minimum 

spanning tree is a spanning tree of minimum weight. 


3.0.1 Data structures and supported opera- 
tions 

Before we discuss the minimum spanning tree algo- 
rithm, we must describe the properties of the principle 
data structure that are required. Since many different 
data structures can be used to implement the algo- 
rithm, we initially describe abstractly the data that 
can be stored by the data structure and the operations 
that can be used to manipulate this data. The data 
consists of a sei of ordered pairs. The first element in 
these ordered pairs is referred to as the item number 
and the second element is called the key value. Or- 
dered pairs may be added and removed from the set; 
however, at all times, the item numbers of distinct or- 
dered pairs must be distinct. It is possible, though, 
for multiple ordered pairs to have the same key value. 

In this paper the item numbers are integers between 1 
and n, inclusive. Our default convention is that i is an 
item number, A is a key value and A is a set of ordered 
pairs. A total ordering on the pairs of a set can be 
defined lexicographically as follows: (i, A) < (i',A') iff 
k < A' or (A = A' and i < i r ). Our data structure 
should support a subset of the following operations. 

mem&er(t, A) returns a boolean value of true if A con- 
tains an ordered pair with item number i, other- 
wise returns false. 

insert(i % kyh) adds the ordered pair (£, A) to the set A. 

<fe/e(e(i,A) deletes the unique ordered pair with item 
number i from A. 

changehey{i t kyh) is executed only when there is an 
ordered pair with item number i in A. This pair 
is replaced by (i,A). 

dcletemin(h) returns the ordered pair which is small- 
est according to the total order defined above 
and deletes this pair. If A is the empty set then 
the token “empty” is returned. 

predecessor{i % A) returns the item number of the or- 
dered pair which immediately precedes the pair 
with item number / in the toLal orcirr. If there 
is no predecessor then the token “smallest" is 
returned. 

Many different types and combinations of data struc- 
tures can be used to support these operations effi- 
ciently. In our case, we will actually use two different 
data structure methods to support these operations. 
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One method will be used in the first execution of the 
algorithm and another, faster and simpler, method will 
be used in the second execution. The second method 
relies on a trail of data which is output by the first 
execution. 

3.0.2 MINSPAN alg ritlim 

Before discussing precise implementation details for 
these methods we present the overall algorithm used 
in both executions. Pidgin code for this algorithm ap- 
pears below. In addition, Figure 2 illustrates the exe- 
cution of the algorithm on a sample graph ami the ta- 
ble below records the data structure operations the al- 
gorithm must perform when run on the sample graph. 
The first column of the table gives the operations ex- 
cept member and with the parameter h dropped to 
reduce clutter. The second column gives the evolving 
contents of k. The third column records the ordered 
pair deleted by the dcletemin operation. The fourth 
column records the certification trail corresponding to 
these operations and is further discussed below. 

The algorithm uses a "greedy” method to "grow” 
a minimum spanning tree. The algorithm starts by 
choosing an arbitrary vertex from which to grow the 
tree. During each iteration of the algorithm & new 
edge is added to the tree being constructed. Thus, the 
set of vertices spanned by the tree increases by exactly 
one vertex for each iteration. The edge which is added 
to the tree is the one with the smallest weight. Fig- 
ure 2 shows this process in action. Figure 2(a) shows 
the input graph, Figures 2(b) through 2(e) show sev- 
eral stages of the tree growth and Figure 2(f) shows 
the final output of the minimum spanning tree. The 
solid edges in Figures 2(b) through 2(e) represent the 
current tree and the dotted edges represent candidates 
for addition to the tree. 

To efficiently find the edge to add to the current 
tree the algorithm uses the data structure operations 
described above. As soon as a vertex , say r, is ad- 
jacent to some vertex which is currently spanned it is 
inserted in the set h. The key value for v is the weight 
of the minimum weight edge between v and some ver- 
tex spanned by the current tree. The array element 
prefcr{ v) is used to keep track of this minimum weight 
edge. As the tree grows, information is updated by op- 
erations such as in$crt[i 9 k % h) and ckangckcy(i } k y h). 
The <fe/c<emm(/i) operation is used to select the next 
vertex to add to the span of the current tree. Note, 
the algorithm does not explicitly keep a set of edges 



Figure 2: Example for minimum spanning tree algo- J 

rithm. j 

representing the current tree. Implicitly, however, if 

(v } h) is returned by dclctcmin then prcfer(v) is added \ 

to the current tree. 

k 

3.0.3 First execution of MINSPAN 

In the first execution of the algorithm, t lie MINSPAN 
code is used and the principle data structure is imple- 
mented with a balanced search tree such as an AVL 
tree [1], a red-black tree (12], or a b-tree |5]. In addi- 
tion. an array of pointers indexed from 1 to n is used. 

The balanced search tree stores the ordered pairs in h 
and is based on the total order described earlier. The 
array of pointers is initially all nil. For each item i, 
the ith pointer of the array is used to point to the lo- 
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Algorithm MINSPAN(G,w<ip^O 
Input: Connected graph G = (V, £) where V = 
with edge weights. 

Output: Spanning tree of G which has minimum weight 

1 CHOOSE root € V 

2 FOR ALL u € V r , Jbey(u) := co END FOR 

3 /i :== 0 ; v := root 

4 WHILE v ? empty DO 

5 &ey(v) := — oo 

6 FOR EACH [v,w] € E DO 

7 IF weight([v % w]) < key{w) THEN 

8 Jfcey(u/) := weight([v, u»]); pre/er(u;) := [p, u>] 

9 IF member(w,h) THEN changekey(w, iey(u;), A) 

10 ELSE injert(u\ Jbey(tu), h) END IF 

11 END IF 

12 END FOR 

13 (t',Jk) dele te min (h) 

14 END WHILE 

15 FOR ALL u € V - {root}, OUTPUT(pre/er(u)) 
END MINSPAN 


Figure 3: Code for MINSPAN Algorithm 


cation of the ordered pair with item number i in the 
balanced search tree. If there is no such ordered pair 
in the tree then the ith pointer is nil. This array allows 
rapid execution of operations such as mem6er(t, h) and 
delete (i } h). 

The certification trail is generated during the first 
execution as follows: When CHOOSE root € V is exe- 
cuted in the first step, the vertex which is chosen is out- 
put. Also, each time insert(i % fc, h) or c hangekey(i, Jfe, h ) 
are executed, predecessor^, h) is executed afterwards, 
and the answer returned is output. This is illustrated 
in column labeled “Trail” in the table above. 

3.0.4 Second execution of MINSPAN 

The second execution of the algorithm also uses the 
MINSPAN code; however, the CHOOSE construct and 
the data structure operations are implemented differ- 
ently than in the first execution. The CHOOSE is 
performed by simply reading the first element of the 
certification trail. This guarantees the same choice of 
a starting vertex is made in both executions. Figure 4 
depicts the principle data structure used which we call 
an indexed linked list The array is indexed from 1 to n 
and contains pointers to a singly linked list which rep- 
resents the current contents of h . Each element in the 


Operation 

Set of Ordered Pairs 

Trail 

msert(2,200) 

(2,200) 

smallest 

mser*(6,500) 

(2, 200), (6, 500) 

2 

deletemin 

(6,500) 


msert(3 t 800) 

(6, 500), (3, 800) 

6 

changekey( 6,450) 

(6, 450), (3, 800) 

smallest 

insert(7,505) 

(6, 450), (7, 505), (3, 800) 

6 

deletemin 

(7, 505), (3, 800) 


insert (5, 250) 

(5, 250), (7, 505), (3, 800) 

smallest 

cAanyeI:ey(7,495) 

(5, 250), (7, 495), (3, 800) 

5 

deletemin 

(7, 495), (3, 800) 


cAanye£ey(3,350) 

(3, 350), (7, 495) 

smallest 

insert(4,700) 

(3, 350), (7, 495), (4, 700) 

7 

deletemin 

(7, 495), (4, 700) 


cJianyefcey(4,650) 

(7, 495), (4, 650) 

7 

deletemin 

(4,650) 


deletemin 



deletemin 




Table 1: Data structure operations and certification 
trail for MINSPAN 


list stores an ordered pair in h except the head of the 
list which contains the special ordered pair (0, —INF). 
The list is organized such that a traversal from the 
head gives the sorted ordering of the current contents 
of h from smallest to largest. The ith element of the 
array points to the node containing the ordered pair 
with the item number i if it is present in h\ otherwise, 
the pointer is nil. The Oth element of the array points 
to the node containing (0, —INF). Initially, the array 
contains nil pointers except the Oth element. We now 
show how to implement the data structure operations. 
To perform inseri(i,k,h), it is necessary to read 
the next value in the certification trail. This value, 
say j, is the item number of the ordered pair which is 
the predecessor of (t, it) in the current contents of h. 
A new linked list node is allocated and the trail infor- 
mation is used to insert the node into the data struc- 
ture. Specifically, the jth array pointer is traversed 
to a node in the linked list, say Y, (If j = “smallest" 
then the Oth array pointer is traversed.) The new node 
is inserted in the list just after node Y and before the 
next node in the linked list (if there is one). The data 
field ill the new node is set to (i, k) and the ith pointer 
of the array is set to point to the new node. Figure 
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4 shows the insertion of (7,505) into the data struc- 
ture given that the certification trail value is G. Figure 
3(a) is before the insertion and Figure 3(b) is after the 
insertion. 

When the insert operation is performed, some checks 
must be conducted. First, the ilh array pointer must 
be nil before the operation is performed. Second, the 
sorted order of the pairs stored in the linked list must 
be preserved after the operation. That is, if is 

stored in the node before ( i, A* ) in the linked list and 
(£",*") is stored after (i,*), then (t\* # ) < («. fc) < 
(i ,f , Jfc") must hold in the total order. If either of these 
checks fails then execution halls and "error" is output. 

To perforin de/ete(i, h) the ith array pointer is tra- 
versed and the node found is deleted from the linked 
list. Next, the ith array pointer is set to nil. Figure 
4 shows the deletion of item number 7 if one consid- 
ers Figure 3(a) as depicting the data structure before 
the operation and Figure 3(b) depicting it afterwards. 
When the delete operation is performed one check is 
made. If the ith array pointer is nil before the opera- 
tion then the execution halts and “error” is output. 

To perform changekey (i, k t h) it suffices to perform 
delete^, h ) followed by inscri{i } k t h). Note, this means 
the next item in the certification trail is read. Also, 
the checks associated with both these two operations 
are performed and the execution halts with “error” 
output if any check fails. 

To perform deleiemin[h) the Oth array pointer is 
traversed, to the head of the list and the next node 
in the list is accessed. If there is no such node then 
“empty” is returned and the operation is complete. 
Otherwise, suppose the node is Y and suppose it con- 
tains the ordered pair (t,fc), then the node Y is deleted 
from the list, the ith array pointer is set to nil, and 
(i, k) is returned. 

Lastly, to perform mem&er(i, h) the ith array pointer 
is examined. If it is nil then false is returned, other- 
wise, true is returned. The prcdcces$or(i,h) operation 
is not used in the second execution. 

This completes the description of the second exe- 
cution. To show that what we have described is a cor- 
rect implementation of the certification trail method 
requires a proof. The proof has several parts of varying 
difficulty. First, one must show that if the first execu- 
tion is fault-free then it outputs a minimum spanning 
tree. Second, one must show that if the first and sec- 
ond executions are fault-free then they both output 
the same minimum spanning tree. Both these parts of 



Figure 4: Example of the data structure used in the 
second execution of MINSPAN. 

the proof are not difficult to show. 

The third more subtle part of the proof deals with 
the situation in which only the second execution is 
fault-free. This means an incorrect certification trail 
may be generated in the first execution. In this case, 
we must show that the second execution outputs ei- 
ther the correct minimum spanning tree or “error”. 
The checks that were described above have been care- 
fully designed to assure precisely this property by de- 
tecting any errors that would prevent the execution 
from generating the correct output. Because of space 
restrictions we will not give the proof here. 



3.0.5 Time complexity comparisons of the two 
executions 


In the first execution each data structure operation 
can be performed in 0(log(n)) time where |V| = n. 
There are at most O(m) such operations and O(m) 
additional time overhead where |2?| = m. Thus, the 
first execution can be performed in 0(mlog(n)) We 
note that this algorithm does not achieve the fastest 
known asymptotic time complexity which appears in 
[10). However, the algorithm we have presented has a 
significant! smaller constant of proportionality which 
makes it competitive for reasonably sized graphs. In 
addition, it provides us with a relatively simple and 
illustrative example of the use of a certification trail. 
It should be mentioned that we have developed a more 
complex certification trail solution for an asymptoti- 
cally faster minimum spanning tree algorithm which 
uses fibonacci heaps. 
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In the second execution each data structure oper- 
ation can be performed in 0(1). There are still at 
most 0(m) such operations and 0(m) additional time 
overhead. Hence, the second execution can be per- 
formed in 0(m) time. In other words, because of the 
availability of the certification trail, the second execu- 
tion is performed in linear time. There are no known 
0(m) time algorithms for the minimum spanning tree 
problem. Komlos was able to show that 0(m) com- 
parisons suffice to find the minimum spanning tree. 
However, there is no known 0(m) time algorithm to 
actually find and perform these comparisons. Even 
the related “verification* problem has no known lin- 
ear time solution. In the verification problem the input 
consists of an edge weighted graph and a subtree. The 
ouput is “yes* if the subtree is the minimum spanning 
tree and “no* otherwise. The best known algorithm 
for this problem was created by Tarjan [25] and has 
the nonlinear time complexity of 0(ma(m,n)), where 
•a(m, n) is a functional inverse of Ackerman’s function. 
The fact that the data in a certification trail enables 
a minimum spanning tree to be found in linear time 
is, we believe, intriguing, significant, and indicative of 
the great promise of the certification trail technique. 

3.1 Concurrency of Executions 

In some cases, it is possible to start the second execu- 
tion before the first execution has terminated. This is 
a highly desirable capability when additional hardware 
is available to run the second execution (for example, 
with multiprocessor machines, or machines with co- 
processors or hardware monitors). 

In the case of the minimum spanning tree prob- 
lem, the two executions can be run concurrently. It 
is only necessary for the second execution to read the 
certification trail as it is generated - one item number 
at a time. Thus there is a slight time lag in the sec- 
ond execution. This potential for concurrecy has been 
found in other problems we have examined, e.g M the 
Huffman tree problem. 

An additional opportunity for overlapping execu- 
tion occurs when the system has a dedicated compara- 
tor. In this case it is sometimes possible for the two 
executions to send there output to the comparator as 
they generate it. For example, this can be done in the 
minimum spanning tree problem where the edges of 
the tree can be sent individually as they are discov- 
ered by both executions. 


4 Comparison of Techniques 

The certification trail approach, whether implemented 
in hardware or software or some combination thereof, 
has resemblances with other fault tolerant techniques 
that have been previously proposed and examined, but 
in each case there are significant and fundamental dis- 
tinctions. These distinctions are primarily related to 
the generation and character of the certification trail 
and the manner in which the secondary algorithm or 
system uses the certification trail to indicate whether 
the execution of the primary system or algorithm was 
in error and/or to produce an output to be compared 
with that of the primary system. 

To begin, we compare the certification trail ap- 
proach to N-version programming[8, 4). This approach 
specifies that N different implementations of an al- 
gorithm be independently executed with subsequent 
comparison of the resulting N outputs. There is no 
relationship among the executions of the different ver- 
sions of the algorithms other than they all use the 
same input; each algorithm is executed independently 
without any information about the execution of the 
other algorithms. In marked contrast, the certification 
trail approach allows the primary system to generate a 
trail of information while executing its algorithm that 
is critical to the secondary system’s execution of its 
algorithm. In effect, N-version programming can be 
thought of relative to the certification trail approach 
as the employment of a nu/J trail. 

A software/hardware fault tolerance technique called 
the recovery block approach [20, 2, IT] uses acceptance 
tests and alternative procedures to produce what is to 
be regarded as a correct output from a program. When 
using recovery blocks, a program is viewed as a being 
structured into blocks of operations which after exe- 
cution yield outputs which can be tested in some in- 
formal sense for correctness. The rigor, completeness, 
and nature of the acceptance test is left to the program 
designer [2]. Indeed, formal methodologies for the def- 
inition and generation of acceptance tests have thus 
far not been fully established. Regardless, the certifi- 
cation trail notion of a secondary system that receives 
the same input as the primary system and executes 
an algorithm that takes advantage of this trail to effi- 
ciently produce the correct output and/or to indicate 
that the execution of the first algorithm was correct 
does not fall into the category of an acceptance test. 

Recently Blum and Kannan[7] have defined what 
they call a program checker. A program checker is 
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an algorithm which checks the output of another algo- 
rithm for correctness and thus it is similar to an accep- 
tance test in a recovery block. An example of a pro- 
gram checker is the algorithm developed by Tarjan[25] 
which takes as input a graph and a supposed mini- 
mum spanning tree and indicates whether or not the 
tree actually is a minimum spanning tree. The Blum 
and Kannan checker is actually more general than this 
because it is allowed to be probabilistic in a care- 
fully specified way. There are two main differences 
between this approach and the certification trail ap- 
proach. First, a program checker may call the algo- 
rithm it is checking a polynomial number of times. In 
our approach the algorithm being checked is run once. 
Second, the checker is designed to work for a prob- 
lem and not a specific algorithm. That is, the checker 
design is based on the inpul/output specification of & 
problem. The certification trail approach is explicitly 
algorithm oriented. In other words, a specific algo- 
rithm for a problem is modified to output a certifi- 
cation trail. This trail sometimes allows the second 
execution to be faster than any known program check- 
ers for the problem. This is the case for the minimum 
spanning tree problem. 

Space limitations preclude comparisons with the 
following other relevant techniques: watchdog proces- 
sors [18, 6], algorithm based fault tolerance [13], exe- 
cutable assertions [3]. 

5 Concluding Discussion 

We have presented a new, powerful fault tolerant com- 
puting technique called the certification trail approach. 
Our description of this technique has been only in 
terms of applications to software fault tolerance, but 
the certification trail approach can also be implemented 
with hardware. We have illustrated the certification 
trail technique by applying it to a minimum spanning 
tree algorithm. The full version of this paper includes 
applications to a Huffman tree algorithm, and a con- 
vex hull algorithm. It should be understood that the 
approach is in no way limited to these algorithms. We 
believe that our consideration of these algorithms gives 
insight into the significance and desirability of the ap- 
proach. We have found several other algorithms to 
w*hich our techniques apply including an algorithm for 
the shortest path problem and we believe the technique 
w*Ul be widely applicable. We have also examined the 
general problem of “certifying” data structure opera- 


tions as discussed above and have proven results for 
additional data structures. These results are impor- 
tant because they allow- the certification trail approach 
to be applied to any algorithm which uses one of these 
data structures. 

In the problem discussed an asymptotic speed up 
was achieved between the first execution and the sec- 
ond execution which was greater than any constant 
factor. We note, however, even if the speed up were 
only by a constant factor, it would still make sense 
to use the technique because execution time would be 
saved. We also note that the certification trail tech- 
nique can be used in conjunction w'ith other software 
fault tolerance techniques. For example, multiple al- 
gorithms can be developed which generate and read 
multiple (but different) certification trails. Further 
these algorithms could be written by separate teams of 
individuals. A general architecture for the interaction 
of these algorithms is an important research topic. For 
example, a “cascade” of algorithms numbered from 1 
to N could be designed such that algorithm i sends 
a certification trail to i + 1 which allows i -f 1 to run 
faster than i . When errors are detected, other ver- 
sions of algorithms can be invoked which may use an 
earlier certification trail or ignore it. The ideas devel- 
oped in recovery blocks and N-version programming 
among others could be used as guidance in exploring 
such issues. 
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Figure 1: Certification trail method. 


output such that (d,s) € P. 

Definition 2.2 Let P : D — * S be a problem. A solu- 
tion to this problem using a certification trail consists 
of two functions F\ and F 3 with the following domains 
and ranges f 1 : D S x T and f* : D x T — > 
S U {error}. T is the set of certification trail*. The 
functions must satisfy the following two properties: 

(1) for all d € D there exists # € S and 

there exists t E T such that 

F\(d) = (*, f) and Fi(d t <) = t and (d, s) 6 P 

(2) for all d € D and for all t 4 T 

either (^(d, i) ss s and (d, j) € P) or 
^(djf) = error. 
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We also require that F\ and Fj be implemented 
so that they map elements which are not in their re- 
spective domains to the error symbol. The definitions 
above assure that the error-detection capability of the 
c *ttification-traH approach is similar to that obtained 
^th the simple time-redundancy approach discussed 
^arlier. (That is, if transient hardware faults occur 
during only one of the executions then either an er- 
*or will be detected or the output will be correct.) It 
*hould be further noted, however, the examples to be 
^usidered will indicate that this new approach can 
save overall execution time. 

Observant readers of our earlier paper [11] in which 
c mt *oduced the notion of a certification trail might 
*. Yc n °iiced that our certification-trail solution for the 
J^-spanning tree was generalizable. The generalised 
j^hnique allows one to generate a certification trail 
many algorithms which use a balanced binary tree 
eft * struclurc * However, the technique relies on the 
cient execution of the predecessor operation and 
structures such as heaps cannot execute 
d C H^ c ^ cssor operation efficiently. The techniques 
^J^tibed in this paper are even more general and pow- 
und they do apply to heaps. 

degree of diversity or independence achieved 
*u using certification trails depends on how they 


are used. A fuller discussion of this and of the re- 
lationship between certification trails and other ap- 
proaches to software fault tolerance is contained in the 
expanded version of [11]. This current paper presents 
asymptotic analysis which shows that the certification- 
trail approach is desirable even when the overhead of 
generating the certification-trail is included. We are 
currently working on an experimental analysis of the 
method and initial results are quite promising. 


3 Answer-Validation Problem for 
Abstract Data Types 

Our general approach to applying certification trails 
uses the concept of an abstract data type. Some exam- 
ples of abstract data types are given later in this paper. 
Here we mention some important common properties 
and give a short illustration. Each abstract data type 
has a well defined data object or set of data objects, 
and each abstract data type has a carefully defined fi- 
nite collection of operations that can be performed on 
its data object(s). Each operation takes a finite num- 
ber of arguments (possibly sero), and some but not 
all operations return answers. An example of an ab- 
stract data type is a priority queue. The data object 
for a priority queue is an .ordered pair of the form (i,k) 
where i is an item number and k is a key value. A pri- 
ority queue has two operations: insert(i,k) and delmin. 
The insert operation has two arguments: item number 
i and key value k. The insert operation does not return 
an answer. The delmin operation has no arguments, 
but it does return an answer. The precise semantics 
of these operations axe given later in this paper. 

For each abstract data type we define an answer- 
validation problem. Intuitively, the answer validation 
problem consists of checking the correctness of a se- 
quence of supposed answers to a sequence of opera- 
tions performed on the abstract data type. More for- 
mally, the input to the answer- valid at ion problem is 
a sequence of operations on the abstract data type 
together with the arguments of each operation. In ad- 
dition, the sequence contains the supposed answers for 
each of the operations which return answers. In par- 
ticular, each supposed answer is paired with the oper- 
ation that is supposed to return it. Examples of such 
inputs are given in the columns labelled “Operation” 
and “Answer” of table 1 and table 2. 

The output for the answer-validation problem is 
the word “correct" if the answers given in the input 
match the answers that would be generated by actually 
performing the operations. The output is the word 
“incorrect” if the answers do not match. It is also 
useful to allow the output word to say “ill-formed". 
This output is used if the sequence of operations is ill- 
formed, e.g., an operation has too many arguments or 
an argument refers to an inappropriate object. 
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A Vfcctrart additional software. It is particularly valuable for de- 

lecting errors caused by transient fault phenomena. If 
such faults cause an error during only one of the ex- 
ecutions then either the error will be detected or the 
output will be correct. The second possibility, of und*. 
tected faults, occurs when the output of the execution 
is unaffected by the faults. 

The certification-trail technique is designed to ob- 
tain similar types of error-detection capabilities but 
expend fewer resources. The central idea, as illus- 
trated in Figure 1, is to modify the first algorithm 
so that it leaves behind a trail of data which we call t 
certification trail. This data is chosen so that it can al- 
low the the second algorithm to execute more quickly 
and/or have a simpler structure than the first algo- 
rithm. As above, the outputs of the two executions 
are compared and are considered correct only if they 
agree. Note, however, we must be careful in defining 
this method or else its error detection capability might 
be reduced by the introduction of data dependency 
between the two algorithm executions. For example, 
suppose the first algorithm execution contains an error 
which causes an incotrect output and an incorrecl trail 
of data to be generated. Further suppose that no error 
occurs during the execution of the second algorithm. It 
still appears possible that the execution of the second 
algorithm might use the incorrect trail to generate an 
incorrect output which matches the incorrect output 
given by* the execution of the first algorithm. Intu- 
itively, the second execution would be “fooled” by the 
data left behind by the first execution. The definitions 
we give below exclude this possibility. They demand 
that the second execution either generate a correct an- 
swer or signal that an error has been detected in the 
data trail. 


Certification trails are a recently introduced and promis- 
ing approach to fault detection and fault tolerance [11]* 
In this paper, we significantly generalise the applica- 
bility of the certification trail technique. Previously, 
certification trails had to be customised to each algo- 
rithm application, but here we develop trails appro- 
priate to wide classes of algorithms. These certifica- 
tion trails are based on common data-structure oper- 
ations such as those carried out using balanced binary 
trees and heaps. Any algorithm using these sets of 
operations can therefore employ the certification trail 
method to achieve software fault tolerance. To exem- 
plify the scope of the generalisation of the certification 
trail technique provided in this paper, constructions of 
trails for abstract data types such as priority queues 
and union-find structures will be given. These trails 
are applicable to any data-structure implementation of 
the abstract data type. It will also be shown that these 
ideas lead naturally to monitors for data-structure op- 
erations. 

Keywords: Software fault tolerance, error monitor- 
ing, certification trails, design diversity, data struc- 
tures. 


1 Introduction 

In this paper we significantly generalize the novel and 
powerful certification-trail technique for achieving fault 
tolerance in systems that was introduced in [11]. Al- 
though applicable to both hardware and software, we 
restrict our discussion of the certification-trail tech- 
nique in the following to software fault tolerance. To 
explain the essence of the certification- trail technique 
for software fault tolerance, we will first discuss a sim- 
pler fault-tolerant software method. In this method 
the specification of a problem is given and an algo- 
rithm to solve it is constructed. This algorithm is ex- 
ecuted on an input and the output is stored. Next, 
the same algorithm is executed again on the same in- 
put and the output is compared to the earlier output. 
If the outputs differ then an error is indicated, other- 
wise the output is accepted as correct. This software 
fault tolerance method requires additional time, so- 
called time redundancy [8, 10]; however, it requires no 

1 Research partially supported by NSF Grant* CCR-8910669 
and CCR-6908092. 

3 Research partially supported by NASA Grant NSG 1442. 


2 Formal Definition of a Certi- 
fication Trail 

In this section we will give a formal definition of a 
certification trail and discuss some aspects of its real- 
izations and uses. 

Definition 2.1 A problem P is formalized as a rela- 
tion, i.e., a set of ordered pairs. Let D be the domain 
(that is, the set of inputs) of the relation P and let 
S be the range (that is, the set of solutions) for the 
problem. We say an algorithm A solves a problem P 
iff for all d € D when d is input to A then an i € S is 
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The answer-validation problem is similar to the 
idea of an acceptance test which is used in the recovery- 
block approach [9, 2] to software fault tolerance. The 
main difference is that an answer- validation problem is 
dependent upon a sequence of answers, not just an in- 
dividual answer. Kenee, if an incorrect answer appears 
in the sequence, it may not be detected immediately. 
It is guaranteed, however, that an incorrect answer 
will be detected at some point during the processing 
of the entire sequence. By allowing for this latency in 
detection, it is possible to create a much more efficient 
procedure for solving the answer- validation problem. 

In this paper we shall solve the validation problem 
for two abstract data types. In the full version of this 
paper we solve the answer- validation problem for more 
genera] data types [12]. 

The most important aspect of the answer-validation 
problem is that it is often possible to check the cor- 
rectness of the answers to a sequence of operations 
much more quickly than actually calculating what the 
answers should be from scratch. In other words, the 
answer-validation problem has a smaller time complex- 
ity than the original abstract-data-type problem. For 
example, to calculate the answers to a sequence of n 
priority-queue operations takes n(nlog(n)) time, how- 
ever it is possible to check the correctness of the an- 
swers in only 0(n) time. This speedup is very useful 
in fault-detection applications. 

It is possible to run an answer-validation algorithm 
for some abstract data type concurrently with some 
algorithm which uses the abstract data type. The 
answer-validation algorithm could act as a monitor 
making sure that all interactions with the abstract 
data type are handled correctly. This is valuable be- 
cause many algorithms spend a large fraction of their 
time operating on abstract data types. Note, the over- 
head of this monitor is less than the overhead of ac- 
tually performing the data-type operations a second 
time. 

One possible application of the answer-validation 
problem occurs when it is used in conjunction with a 
repairable data structure which allows for repair but 
does not automatically attempt to detect faults [16]. 
Suppose an abstract data type is implemented with 
a repairable data structure. One can use an answer- 
validation procedure to detect errors in the answers 
generated by the abstract data type. When an er- 
ror is detected, a repair of the data structure can be 
attempted. In some cases, recovery and continued ex- 
ecution will be possible. 

In the next section, we will show’ how to create cer- 
tification trails for programs which use abstract data 
types when those data types have efficient solutions 
for their answer-validation problems. 


4 Schema for using Certification 
Trails 

Suppose that we have developed an efficient solution to 
the answer-validation problem for some abstract dat* 
tyn*. By efficient we mean the lime complexity 0 f 

tlu answer-validation problem is smaller than Iht tim« 
complexity of the original abstract-data-type problem. 
Further, suppose that we wish to run an algorithm, 
say A, which uses that abstract data type. To apply 
the certification trail method we can use the following 
schema to yield the two executions: 

First Execution: 

Execute algorithm A. 

Each time an abstract-data-type operation is perform*-, 
append to the certification trail the identity of the op- 
eration, the arguments and the answer. 

Second execution: 

Phase One: 

Validate the correctness of the operations and sup- 
posed answers giver, in the certification trail. If the 
validation returns ‘'incorrect* 1 or “ill-formed” then out- 
put “error” and stop. Otherwise, continue. 

Phase Two: 

Execute algorithm A. 

Each time an abstract-data-type operation is performed, 
read the next entry in the certification trail. Make sure 
that the operation and the arguments in the certifica- 
tion trail agree with those requested in the algorithm. 
If not output "error” and stop. Otherwise, use the 
answer given in the certification trail and continue. 

In the final step the outputs from the two execu- 
tions are compared and the output is accepted or an er- 
ror is signaled. This schema car. yield execution times 
w’hich are significantly faster than the execution time 
obtained by running algorithm A twice, yet these two 
method' give similar fault detection capabilities. That 
is,, if transient hardware faults occur during orJy one 
of the executions then either an error w’ill be detected 
or the output will be correct. Noie, the first execution 
can be slower than a simple execution of algorithm 
A since it must output a certification trail. However, 
the second execution can be significantly faster than 
a simple execution of the algorithm since the interac- 
tions with the abstract data type take less time overall. 
The r*t effect can be a major speedup. 

Suppose an algorithm uses multiple abstract data 
types and suppose there are efficient answer-validation 
algorithms for each of these abstract data types. It is 
easy to see how our method generalizes. We can leave 
behind a generalized certification trail which consists 
of a separate certificatior trail for each of the abstract 
c ata types. The effect on the speedup of the second 
execution will be cumulative. 
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Figure 2: Union Tree and with Find Edges 


5 Answer Validation for Disjoint- 
Set Union 

As^ our first example we will discuss the disjoint-set 
union problem. This problem concerns a dynamic col- 
lection of sets in which pairs of sets can be combined 
to yield new sets. The underlying universe of set el- 
ements consists of the integers from 1 to n inclusive. 
Also, the universe of set names consists of the integers 
from 1 to n inclusive. There are three operations that 
can be performed: 


create(A,x) creates a singleton set named A which 
contains element x. Since sets must be disjoint we 
require that x not already be in some set. 

union(A,B) creates a new set which is the union 
of the sets named A and B. This new set is called A 
*nd the set named B becomes undefined. It is required 
that the sets named A and B are originally defined and 
*re disjoint. 

find(x) returns the name of the set which contains 
dement x. It is required that x be a member of some 
unique set. 

If an operation violates one of the requirements 
described above then it is considered to be ill-formed. 
Aiso, if an operation has the wrong number or type of 
Arguments it is considered to be ill-formed. 

In table 1 we give an example of a sequence of 
disjoint-set-union operations together with the answers 
,°* find operations. In addition, the collection of sets 
* depleted as it is changed by the operations. For sim- 
plicity, in this example each set name corresponds to 
he integer originally contained in the set when it is 
created. Sets are listed by first giving the name of the 

*et followed by a colon and then the contents of the 
*et. 

The disjoint-set-union problem is a classic problem 
w hich has many applications [4] such as the off-line 


Operation 


Answer Status of sets 


create(l l l) 

create(2,2) 

union(l,2) 

find(2) 1 

cxeate(3,3) 

create(4,4) 

create(5,5) 

union(5,3) 

union(5,l) 

find(2) 5 

find(5) 5 

create(6,6) 

union(4 t 6) 

create(7,7) 

union(4 f 7) 

find(6) 4 




[1,2] 

\,H 


,1,2, 

K3:{ 

1H 

1.2 

K3:j 

1H 

1.2 

K4:{ 

4:\ 

[4}.5:{1, 



4:{ 

4: <l 
4: ' 
4: ' 


4},5:{l,2,3 1 5},6:{6} 

4,6},5:{1,2,3,5> 

4,6),5:{1.2.3,5},7:{7} 

4.6,7},5:{1,2,3,5} 


Table 1: Sequence of operations for a Disjoint Set 
Union 


min problem, connected components, least-common 
ancestors, and equivalence of finite automata. Of par- 
ticular interest is the time-complexity of performing a 
sequence of operations. Let us say the total number of 
operations is m, which is assumed to be greater than 
or equal to n. Recall, n is the number of set elements 
and set names. 

Tarjan gave the tight upper bound of 0(mcr(m, n)) 
[13, 14] for this problem. The a refers to the inverse 
of Ackermann’s function which is a very slowly grow- 
ing function. His solution and earlier solutions used 
a path-compression heuristic [15]. Fredman and Saks 
gave a lower bound of fi(ma(m,n)) [5] in a general 
cell-probe modeL Gabow and Tarjan show how to 
solve some important special cases of this problem in 
O(m) time [6]. 

We now consider the answer-validation problem for 
the disjoint-set-union data type. We will show that 
this problem can be solved in O(m) time where m 
is the number of operations. Note, this time com- 
plexity is superior to the complexity of actually per- 
forming the sequence of operations as discussed above. 
One method for solving this problem in 0(m) time 
uses the powerful techniques of Gabow and Tarjan [6]. 
However, we shall present a simpler method with a 
small constant of proportionality that is tailored to 
this problem. 

To solve this problem we will build a forest based 
on the union operations in the sequence. In addition, 
we shall add edges to this forest based on the find 
operations. As a final step we will perform a traversal 
of the forest and perform appropriate checks. The solid 
edges in figure 2 indicate the forest we would build for 
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the tel of operations given in table 1. The dashed 
edges indicate the edges we would add to the forest 
based on the find operations. 

Algorithm for Answer Validation for Disjoint* 
Set Union 

Input: sequence of m operations together with argu- 
ments and supposed answers for the disjoint-set union 
data type. 

Output: “correct", “incorrect" or “ill-formed” 

Declarations: Type treenode has fields left and right. 
Type trteleaf contains a list of pointers such that each 
pointer points to a treenode or a treeleaf. Array ac- 
tivcset is indexed by set name. Each array element is 
a pointer to a treenode or a treeleaf. Array t vhereis is 
indexed by an element number. Each array element 
is a pointer to a treeleaf. Initially, all pointers are nil 
and lists are null. 

In the first phase of the algorithm we process each op- 
eration as it appears serially using the following rules: 

create(A,x): If activeset[A] or whereis[x] are non-nil 
then output “ill-formed" and stop. Otherwise, allocate 
a treeleaf and set activeset[A] and whereis[x] to the 
allocated node. 

union(A t B): If activeset[A] or activeset[B] are nil then 
output “ill-formed” and stop. Otherwise, allocate a 
treenode and set left to activeset[A] and right to ac- 
tivesel[B]. Next set activeset[A] to the treenode and 
set activeset[B] to nil. 

find(x) A: (where A is the supposed answer to the 
find.) If whereis[x] is nil then output “ill-formed". 
Otherwise, wherei$[x] points to some treeleaf. Call it 
tleaf. If activeset[Aj is nil then output “ill-formed". 
Otherwise, acliveset[A] points to some treeleaf or treen- 
ode. Call it t. Add a pointer to t to the list of pointers 
contained in treeleaf. 

In the second phase of the algorithm we shall traverse 
the structure we have built. 

Scan thru the array activeset to find non-nil pointers. 
It is not hard to see that each non-nil pointer points 
to the root of a tree made up of nodes of type tnode 
and tleaf. The tree uses the edges in the left and right 
fields of tnode. 

For each such tree perform a depth-first search. When- 
ever the search reaches a node of type tleaf traverse 
the list of pointers that it contains. Check that each 
pointer points to a node which is currently on the stack 
which is used to perform the depth-first search. This is 
equivalent to checking that each pointer in tleaf points 
to a node which is an ancestor of tleaf in the tree. 

If some pointer does not point to an ancestor then out- 
put “incorrect" and stop. Otherwise, output “correct" 
and stop. 


Theorem 5.1 The algorithm for answer ualidafton 0/ 
the dxs)Oint»»et~un\on abstract data type is correct. 

Theorem 5.2 The answer validation algorithm for 
joint j cl union has 0 time complexity of 0(m) for p r0n 
eating a sequence of m operationt . 

We omit these two theorems which overall are not 
difficult to show*. We comment on one aspect of im- 
plementation. In the second phase of the answer vali- 
dation algorithm it is necessary to determine if certain 
nodes are on the stack during the tree traversal. This 
can be done efficiently as follows: First, each treen- 
ode and each treeleaf can be assigned a unique iden- 
tifier in the range 1 to m as it is allocated. Next, a 
boolean vector of sixe m indexed by the unique iden- 
tifiers described above can be allocated. This vector 
can be used to keep track of which nodes are on the 
stack during tree traversal by turning biis on and off. 
This modified tree traversal algorithm still takes 0(m) 
time. 

6 Generalized Priority Queue 

W'e now describe a somewhat general abstract data 
type. W> will solve the answer validation problem for 
restricted versions of this data type. The data consists 
of a set of ordered pairs. The first element in these or- 
dered pairs is referred to as the item number and the 
second element is called the key value. Ordered pairs 
may be added and removed from the set, however, at 
all times the item numbers of distinct ordered pairs 
must be distinct. It is possible, though, for multiple 
ordered pairs to have the same key value. In this pa- 
per the item numbers are integers between 1 and n, 
inclusive. Our default convention is that i is an item 
number, It is a key value and h is a set of ordered pairs. 
A total ordering on the pairs of a set can be defined 
lexicographically as follow’s: (x\fc) < iff k < k* 

or {k = Jt' and t < i'). The abstract data types w*e will 
consider support a subset of the following operations. 

member(i) returns a boolean value of true if the set 
contains an ordered pair with item number i, 
otherwise returns false. 

insert ( 2 , It) adds the ordered pair to the set. We 

require that no other pair w'ith item number i be 
in the set. 

delete(i) deletes the unique ordered pair with hem 
number i from the set. We require that a pair 
with item number i be in the set initially. 

chengekey(i, k) is executed only when there is an or- 
dered pair with item number i in the set. This 
pair is replaced by (i t A). 
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T Operation Answer Validation stack 

1 insert{6,300) 

2 in*ert{2,404) 

3 inscrt(3,250) 

4 deletemin (3,250) (3,250,4) 

3 insert(10,248) 

6 insert(12,245) 

7 insert(4,260) 

8 deletemin (12,245) (12,245,8), (3, 250,4) 

9 insert(13,140) 

10 insert(5,142) 

11 deletemin (13,140) (13, 140,11), (12, 245 t 8),(3, 250,4) 

12 deletemin (5,142) (5,142,12) l (12,245,8),(3,250,4) 

13 deletemin (10,248) (10,248.13) ,(3,250,4) 

14 deletemin (4,260) (4,260,14) 


Table 2: Sequence of Priority Queue operations illus- 
trating answer validation algorithm 


Each operation is time-stamped, i.c., the opera- 
tions are assigned integers sequentially starting with 
1 which is easy to do with a counter. The answer- 
validation algorithm uses a stack called deletestack. 
The contents of this stack are illustrated in table 2. 
The top of the stack is on the left in table 2, 

Let us consider the kinds of tests that an answer- 
validation algorithm for a priority queue might per- 
form. Suppose (i,k) is the answer to some deletemin 
operation. Further, suppose (i # ,k # ) was deleted in a 
previous deletemin operation. If the priority queue is 
correct then either (i,k)>(i # ,k') or (i'.k') was deleted 
before (i,k) was inserted. This suggests that the time 
of insertion and deletion for elements should be recorded 
and the algorithm below does this. Unfortunately, if 
an algorithm compares an ordered pair which has been 
deleted against all previously deleted ordered pairs 
then the algorithm complexity is at least 0(m J ). To 
avoid this the deletestack is used. The deletestack was 
designed to allow many comparisons to be done im- 
plicitly and to reduce the complexity. 


deletemin (or deletemax) returns the ordered pair which Algorithm for Answer Validation for Priority 


is smallest (or largest) according to the total or- 
der defined above and deletes this pair. If the 
set is empty then the token “empty" is returned. 

min (or max) returns the ordered pair which is small- 
est (or largest) according to the total order de- 
fined above. If the set is empty then the token 
“empty" is returned. 

If an operation violates one of the requirements de- 
scribed above then it is considered to be ill-formed. 
Also, if an operation has the wrong number or type of 
arguments it is considered to be ill-formed. 

Many different types and combinations of data struc- 
tures can be used to support different subsets of these 
operations efficiently. 

7 Answer Validation for Prior- 
ity Queue 

We will first consider the priority-queue abstract data 
type which allows only two operations: insert and 
drietemin. An example of a sequence of such oper- 
ations appears in table 2. Many different data struc- 
tures can be used to implement priority queues includ- 
ing heaps [17], balanced search trees such as AVL trees 
[l], red-black trees [7], or b-trees [3j. It is possible to 
process a sequence of O(n) operations in 0(nlog(n)) 
time using the data structures above. Furthermore, 
there is a lower bound of fl(n log(n)) because it is pos- 
sible to sort using a priority queue. Remarkably, the 
answer-validation problem can be solved using only 
0[n) time, as documented below. 


Queue 

Input: sequence of O(n) operations together with ar- 
guments and supposed answers for the priority-queue 
data type. 

Output: “correct", “incorrect” or “ill- formed" 

Declarations: Array called mseriftme indexed by item 
number. Array dements contain either “absent” or 
a time-stamp. Array called keyvalue indexed by item 
number. Array elements contain either “absent” or 
a key value. Initially, each element in these two ar- 
rays contains “absent”. Stack of ordered triples called 
deUtcslack. Each ordered triple has the following form: 
first element is an item number, second element is a 
key value, and third element is a time-stamp, deletes- 
tack is initially empty. 

In the first phase of the algorithm we process each op- 
eration as it appears serially using the following rules: 

Let currenttime refer to the time-stamp of the opera- 
tion being processed. 

insert(i,k): If inser t time [i]^ “absent" then output “ill- 
formed” and stop. Otherwise, let inserttime[i] = cur- 
renttime and let keyvalue[i]=k. 

deletemin (i.k): (where (i,k) is the supposed answer 
to the deletemin operation.) If inserttime(i] = “absent” 
or keyvalue[i]^k then output “ill- formed” and stop. 

Otherwise, let (i\k # ) be the item number and key 
number of the triple on the top of deletestack (if there 
is one). Repeatedly pop the stack until (i,k)<(i\k') or 
until deletestack is empty. 

If deletestack is empty then push the triple 
(i,k,curienttime) onto deletestack. Further, let insert- 
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time[i] = “absent" and let key value[i]=: “absent" and pro- 
cess the next priority queue operation. 

If deletestaek is non-empty then let the top element 
be (i\k # , deletetime'). innserUimeli^deletetime' then 
output “incorrect" end stop. Otherwise, push the 
triple (i,k, currenttime) onto deletestaek. Next, let in- 
serttime[i]= “absent" and let key value[i] = “absent" and 
process the next priority queue operation. 

In the second phase of the algorithm we operate 
on the items which have been inserted but have never 
been deleted. 

Scan the array inserttime and for each item number 
foj which insert time [i]^ “absent" construct an ordered 
triple (i,keyvdue(i],in5crttime[i]). Call this set of or- 
dered triples remainders. 

Use a bucket sort to sort the triples in remainders by 
their time-stamps, i.e., the third element of the ordered 
triple. 

Merge the triples in remainders together with the triples 
in deletestaek so that they are all ordered by their 
time-stamps, i.e., the third element of the ordered 
triple. 

Scan the combined triples to determine if there exist 
two triples which satisfy the following: insert time[il< 
deletetime' and (i,keyvalue[i])<(i',k'); where one triple 
is from remainders and has the form (i,keyvalue[ij, 
inserttime[i]) and where the other triple is from deletes- 
tack and has the form (i'.k'.deletetime 1 ); 

If these two triples exist then output “incorrect" and 
stop. Otherwise output “correct" and stop. 


Theorem 7.1 The algorithm for answer validation of 
the priority queue abstract data type is correct. 

Proof: Clearly the algorithm for answer validation 

always terminates. We must show that the algorithm 
outputs “correct" iff the operations together with ar- 
guments and supposed answers are correct. Because of 
space limitations we will only give a proof for the more 
difficult half of this iff statement. We shall use a proof 
by contradiction. Assume that the sequence of opera- 
tions, arguments and supposed answers is considered 
correct by the algorithm but actually is incorrect. The 
use of the array inserttime and the symbol “absent" 
assures that no item is deleted when it is absent or in- 
serted when it is already present. The use of the array 
keyvalue assures that items do not change keyvalue 
when they are present in the data type set. There is 
only one remaining way in which a sequence can be 
incorrect. This occurs when an ordered pair is deleted 
by a deletemin operation, however, it does not really 
have the smallest key value. 

This means, there exist ordered pairs (i u k\) and 
(ijikj) such that (iitki)>(ij,k 2 ) and (ii »ki) is deleted 


while (i 2 ,k 2 ) is present in the data type set. In addi. 
lion, we may specify that (i, ,ki) is the largest order^J 
pair deleted while (i 2l k 2 ) is present. Let ins t be tj^ 
time that ij was inserted and let delj be the time that 
i\ was deleted. Let ins* be the time that i 2 was in. 
serted and let del 2 be the time that i 2 was deleted (if 
it was deleted). There are two cases. 

Case 1: Suppose that (i 2l k 2 ) is ultimately deleted 
We know that (ii,ki)>(i 2l k 2 ) by assumption, del 2 >del 
since item i 2 is deleted aftti item i 2 . ins 2 <delj since 1 
item i 2 was present when item \\ was deleted. 

Consider the situation when item i 2 is deleted with 
a deletemin operation. The ordered triple for item \ x 
must appear in deletestaek just before the processing 
of the i 2 deletion operation. This follows because the 
triple for item i\ can only be removed from deletestaek 
by a larger element and yet (i|,ki) refers to the largest 
ordered pair deleted while (i 2 ,k 2 ) was present. Now, 
since (U,ki)>(i 2 ,k 2 ) the ordered triple for item i x will 
remain in deletestaek even after deletestaek is popped 
during the processing of the deletemin operation for 
item i 2 . Suppose the top of deletestaek is (ia.kj.delj) 
after the popping. 

It is easy to show that the time-stamps on deletes- 
tack are monotonically ordered with the largest time- 
stamp at the top. For this reason we know that 
dels>delj. We noted earliex that deli>ins 2 . But if 
jns 2 <del 3 then the algorithm outputs “incorrect" when 
it processes the deletemin operation. This contradicts 
our assumption that the sequence of operations, ar- 
guments and supposed answers was considered correct 
by the algorithm. 

Case 2: Suppose the ordered pair (i 2l k 2 ) is never 
deleted,. In the second phase of the algorithm the or- 
dered triple (i 2t k 2l in$ 2 ) is constructed and is compared 
against the ordered triples in deletestaek. 

The same argument that was used in case 1 above 
can be used to show that the test performed in the 
second phase of the algorithm would detect a problem 
and cause “incorrect" to be output. This contradicts 
our assumption that the sequence of operations, argu- 
ments and supposed answers was considered correct by 
the algorithm. Since both cases lead to a contradiction 
our proof is complete. | 



Theorem 7.2 The answer validation algorithm for pri- 
ority queue has a time complexity of 0(n) for process- 
ing a sequence of 0(n) operations. 


Proof: W*e first analyse phase one of the algorithm. 

Note, there is a constant amount of work done for pro- 
cessing each single operation if we exclude the cost of 
popping the deletestaek. Interestingly, popping the 
deletestaek can take O(n) time for the processing of 
a single operation. Luckily, the total amortized com- 
plexity for popping the deletestaek while processing a 
sequence of O(n) operations is still only 0{n). This 
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is true because each item which is inserted and later 
deleted is placed on deletestack and is popped at most 
once. 

We now consider phase two. The cost of array 
scanning and constructing the triples is 0(n). The 
cost of the bucket sort is 0(n) and the cost of the 
— merge is also O(n). The final test can be implemented 
with a simple scan with a complexity of O(n). Hence 
the overall complexity is O(n) | 

gy 

We have solved the answer-validation problem for 
abstract data structures that support the following set 
of operations: member, insert, delete, deletemin, min, 
_ deletemax, and max. The algorithm used to solve this 
problem is intricate but efficient. It requires only 0(n) 
time to process O(n) operations. A detailed descrip- 
~ tion of our solution, however, is beyond the scope of 
this version of the paper. 
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The results reported in this paper significantly gen- 
I ! eralize the applicability of the certification-trail tech- 
tuque. In our previously reported work on certification 
trails [11], we had to customize each algorithm appli- 
s cation, but we have now developed trails appropriate 

i ] 1° wide classes of algorithms. These certification trails 

i** we based on common data-structure operations such 
as those carried out using balanced binary trees and 
= ; heaps. Any algorithm using these sets of operations 
i c±n therefore employ the certification trail method to 
achieve software fault tolerance. To express the full 
generality of these ideas, we have provided construc- 
tgj tions of trails for abstract data types such as priority 
l queues and union-find structures. These trails are ap- 
plicable to any data-structure implementation of the 
abstract data type. These ideas lead naturally to mon- 
yj ttors for data-structure operations. We are currently 
y working on an experimental evaluation of the approach 
*nd initial results are promising. 
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